Instructor: Aleszu Bajak
This O’Reilly course will introduce participants to the techniques and applications of text mining and sentiment analysis by training them in easy-to-use open-source tools and scalable, replicable methodologies that will make them stronger data scientists and more thoughtful communicators.
Using RStudio and several engaging and topical datasets sourced from politics, social science, and social media, this course will introduce techniques for collecting, wrangling, mining and analyzing text data.
The course will also have participants derive and communicate insights based on their textual analysis using a set of data visualization methods. The techniques that will be used include n-gram analysis, sentiment analysis, and parts-of-speech analysis.
By the end of this live, hands-on, online course, you’ll understand:
And you’ll be able to:
Ideally, participants will have the latest versions of R and RStudio and the tidytext and tidyverse packages. To access all R scripts, participants should next download this Github repository and set it as their working directory in RStudio.
This course can also be accessed on RStudio Cloud here, though a free account is required.
Text mining is all about making sense of text. That could mean counting the frequency of specific words, understanding the overall sentiment of a document, or applying statistical techniques to draw big-picture conclusions from a corpus. Whether one is analyzing social media posts, customer reviews or news articles, these techniques can be essential to understanding and deriving meaningful insights.
Note: Though there are several ways to mine data and perform sentiment analysis in R – with packages such as tm, quanteda, udpipe, and sentimentr – this course uses R’s tidytext package, developed by Julia Silge and David Robinson, and several tidy tools found in the tidyverse package.
BuzzFeed’s analysis of U.S. State of the Union speeches over time is a great example of text analysis. As an added bonus, journalist Peter Aldhous shared all his data and open-sourced his methodology as an Rmarkdown document. Related: New York Times science graphics editor Jonathan Corum also has a cool State of the Union visualization tool on his website.
The New York Times’ Mueller Report citations article is another example of a text analysis in mainstream media, used to explain which of and how often Trump’s associates appeared in the report. Check out my Storybench tutorial that includes R code for mining the Mueller Report for specific keywords.
img
FiveThirtyEight published an analysis tallying the instances of the name “Trump” in 2020 candidate messaging. The dataset was 2020 candidate emails sent to subscribers.
img
The Boston Globe’s Arresting Words investigation visualized transcripts of police arrests to isolate tops words uttered by those being hauled in.
Crimson Hexagon, recently acquired by Brandwatch, delivers “actionable social insights for the enterprise,” i.e. how is Under Armour clothing or 5-hour Energy Drink being discussed online?